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Abstract — The QR Decomposition (QRD) of communication 
cliannel matrices is a fundamental prerequisite to several de- 
tection schemes in Multiple-Input Multiple-Output (MIMO) 
communication systems. Herein, the main feature of the QRD 
is to transform the non-causal system into a causal system, 
where consequently efficient detection algorithms based on the 
Successive Interference Cancellation (SIC) or Sphere Decoder 
(SD) become possible. Also, QRD can be used as a light but 
efficient antenna selection scheme. In this paper, we address 
the study of the QRD methods and compare their efficiency in 
terms of computational complexity and error rate performance. 
Moreover, a particular attention is paid to the parallelism of 
the QRD algorithms since it reduces the latency of the matrix 
factorization. 

Index Terms — MIMO detection, QR decomposition, paral- 
lelism. 

I. Introduction 

Multiple-Input Multiple-Output (MIMO) technologies have 
gained an increasing attention in the recent standardization 
bodies due to their capabilities to improve the reliability of the 
communication Unk and boost the channel capacity without re- 
quiring additional spectral resources [1]. In 3GPP Long-Term 
Evolution (3GPP LTE) and 3GPP LTE-Advanced, MIMO 
techniques with up to 8 transmit antennas are combined with 
the Orthogonal Frequency Division Multiplexing (OFDM) to 
offer both low complexity equipments and robustness to the 
frequency selective fading [2]. 

The main challenge on the receiver side is to introduce 
low-complexity but efficient detection algorithms able to re- 
cover quasi-optimally the transmitted signals. QR Decomposi- 
tion (QRD)-based Successive Interference Cancellation (SIC) 
schemes [3], Sphere Decoder (SD) [4] and QRD with M- 
algorithm (QRD-M) [5] detection algorithms use the QRD 
to transfer the MIMO system from a non-causal, where 
interference from future symbols exists, into a causal system 
where the detection of the current symbol only depends on 
the already-detected symbols. Hence, SIC becomes possible. 
Lenstra-Lenstra-Lovasz (LLL) [6] lattice basis reduction algo- 
rithm also requires the QRD to decrease computational com- 
plexity in the calculation of its orthogonality deficit measure. 

The main idea of the QRD is to factorize the channel matrix 
H as the product of an unitary matrix Q and an upper triangular 
matrix R. Since the matrix Q is unitary, the Euclidean norm, 
the singular values, and the determinant of H and R are equal. 
Thanks to the QRD, the MIMO system is transferred into a 
causal system by simply filtering the received vector using the 



Hermitian transpose of the Q matrix. 

It turns out that the QRD, although it plays an important 
role, has not been sufficiently studied in the literature of the 
wireless communication systems. 

Contributions. The contributions of this paper can be 
summarized as follows: 

• First, several computation methods of the QRD are intro- 
duced. The parallelism capabilities in calculation of the 
QRD are emphasized; 

« A reduced complexity solution is proposed and discussed; 

• The complexity of the introduced algorithms is calculated 
in terms of real multiplications (MUL), with realistic 
implementation considerations. 

The rest of this paper is organized as follows. In Section II, 
we introduce the system model and review several detection 
schemes that are based on the QRD. QRD techniques are 
introduced in details in Section III and simulation results are 
exposed in Section IV. Section V provides a conclusion and a 
discussion. 

II. System Model and MIMO Detection 

A. System Model 

In this paper, we consider a MIMO-Spatial Multiplexing 
(MIMO-SM) system, where independent data symbols are 
transmitted via sufficiently-separated antennas. By considering 
riT transmit antennas and n/j receive antennas - with nn > tit 
to avoid rank-deficiency effects [7] - the received vector 
y G C"" is given by: 



Hx + n, 



(1) 



where x G ^"^ is the transmitted vector with E [xx^] = 
Inr (C is the modulation set and I„y is the tit x tit identity 
matrix) and H is the channel matrix with complex element hij 
whose real and imaginary parts are independent and follow 
7V(0, 0.5). Also, n is the additive white Gaussian vector with 
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B. MIMO Detection 

In MIMO-SM, the optimum Maximum-Likelihood Detector 
(MLD) employs a brute-force detection to estimate the trans- 
mitted vector: 



argmm ||y 
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(2) 



By defining the QRD H = QR of the n/j x tit complex 
channel matrix, with Q an unitary matrix of dimensions 



Algorithm 1 Stable Gram-Schmidt QRD algorithm. 
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nji X riT (Q Q — Irir) ^i^d ^ ^11 upper triangular matrix of 
dimensions tit x tit- By considering y 
problem in (2) can be re-written as: 
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where jji is the i-th entry of y and Xj is a hypothetical value for 
the j-th entry of x and Rij is the (i, j) entry of R. Due to the 
QRD, the ut -dimensional lattice search in (3) is transformed 
into riT parallel one-dimensional search problems. Therefore, 
instead of searching over a multi-dimensional sphere, the 
search is done in parallel over lines, which reduces the 
computational complexity. 

Since embedded communication systems are limited in 
terms of computational complexity, several detection schemes 
have been proposed in the literature to solve the search 
problem in (3) with low computational costs. 

The QRD-based SIC technique was originally introduced in 
[8] and further studied by Wuebben et al. [3], [9]. The decision 
process becomes, for all i from ny to 1: 
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y^ - YT^l^+l R. 
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(4) 



where Q^ denotes the quantification to the modulation set 
^ and Xi and Xj the estimates of the i-th and j-th transmit 
symbols respectively, for j > i. 

Although the SIC schemes outperform the linear schemes, they 
are still far from achieving the optimum performance of the 
MLD [10] and in particular the same diversity. 

SD and in particular QRD-M detection algorithms are 
considered as prominent schemes that achieve quasi-ML per- 
formance while requiring lower complexity compared to the 
MLD. The main idea of the SD is to restrict the search problem 
in (3) to the lattice points inside a hyper-sphere of predefined 
radius d. The decision process takes into account the sphere 
constraint and becomes, for all i from n^ to 1: 
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Although SD has a low average computational complexity, its 
worst-case complexity is still high. The conventional QRD-M 
algorithm has a fixed complexity which is a desired property 
for latency-limited communication systems. The idea of the 
QRD-M is to retain a fixed number of symbol candidates at 
each search level. 

III. QR-DECOMPOSITION TECHNIQUES 

A. Gram-Schmidt Orthogonalization 

Gram-Schmidt (GS) orthogonalization is frequently used in 
communication systems. The GS QRD algorithm consists of 



INPUT: H 
OUTPUT: y, R 
Initialization : Q ^ H, R <— ! 
for i = 1, • ■ ■ , ny do 

Rj, i ^ V ^:, i^:^ * 
^:, i ^ ^:, il ^^i. i 

for j = « + 1, ■ • • , riT do 

end for 
end for 



two steps, namely orthogonalization and normalization steps. 
In the orthogonalization step, a normal vector of the matrix Q 
that is already normalized in the normalization step is obtained 
and the remaining columns of Q are orthogonalized to the 
obtained column. Note that the matrix Q is initiated to H. 
Therefore, the corresponding row of the matrix R is obtained 
from Q. Algorithm 1 depicts a MATLAB-like' algorithmic 
description of the Stable GS (STGS) algorithm by introducing 
a minor update of the Classical GS (CLGS) algorithm. 

B. Householder Reflections 

Householder (HH) constitutes another classical QRD tech- 
nique that is used to obtain the upper triangular matrix R 
from which the matrix Q can be obtained if required. The 
idea behind the HH technique is to obtain the matrix R using 
a reflection matrix. This reflection matrix, also known as 
Householder matrix, is used to cancel all the elements of a 
vector except its first element which is assigned the norm 
of the vector. Therefore, the columns of the matrix H are 
treated iteratively to obtain the R matrix. Algorithm 2 depicts 
a MATLAB-like algorithmic description of the HH technique. 
The diag{-} function returns a matrix with the input vector on 
the main diagonal and the trm{} function extracts the upper 
triangular part of an input matrix, excluding main diagonal. 

C. Givens Rotations 

Givens Rotations (GR) can also be employed to factorize 
the matrix H [11]. This technique is usually used in embedded 
systems because of its numerical stability [12]. The general 
principle of this technique is to cancel the elements of the 
matrix H so that a triangular form is obtained. Therefore, GR 
technique is the only technique which is not iterative and, 
hence, has a parallelism capabilities that will be explored later 
on. 

Algorithm 3 depicts a MATLAB-like algorithmic descrip- 
tion of the GR technique, where the rotation matrix 8 is 

'Conventional MATLAB notation are employed in the introduced algo- 
rithms i.e. A.n, n corresponds to the (n, ?i)-th entry of the matrix A, A;, n 
corresponds to the n-th column- vector of A and A^-.n, n corresponds to 
entries from n to A^ of the n-th column- vector of A 



Algorithm 2 Householder triangulation technique. 



triangular matrix R from H: 
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INPUT: H 
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OUTPUT: y, R 
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Initialization: Q <— H, R ^ I„y 
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for i = 1, • • • , riT do 
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[v,/3] ^ house(Qj:„H,i) 
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end for 
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for i ~ ut, • • • , 1 do 
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Algorithm 3 Givens rotations algorithm. 
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calculated such that the elements of H are cancelled column by 
column, and an upper triangular matrix R is finally obtained. 

D. Parallel Givens Rotations 

The Parallel GR (PGR) technique allows independent oper- 
ations in the QR decomposition process so that the processing 
speed is increased. An example of the PGR is explained in the 
case of a 4 X 4 real matrix. The conventional GR technique 
obtains the following matrix: 
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Oi r2,2 

02 04 

03 05 



'"3,1 

''a, 2 

7-3,3 
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^■4,1 



(6) 



where the entries r^ j represent the elements of H that are 
modified during the QRD. The notation Ofe represents the order 
k in which the corresponding elements of the matrix H are 
cancelled out and such that {r^ j | 1 < «, j < nx} C C and 
{fi.i I 1 < J < rir} C K. Hence, we remark that 6 sequential 
and dependent steps are necessary for the cancellation stage 
for a 4 X 4 real matrix. 

Equation (7) illustrates the PGR technique to obtain the 



R 



ri,i ri^2 r-3,1 r4,i 

01 ^2,2 ^3,2 ^4,2 

02 O3 r3,3 r4,3 

03 O4 O5 r4,4 



(7) 



where 2 elements are simultaneously cancelled at the step 
3 due to their independence. Hence, a parallelization gain 
is obtained which increases with the size of the matrix H. 
Fig. 1, which consists of 8 pipes, shows the advantages of 
the PGR. The white blocks represent a free tube while the 
dark blocks represent a tube being doing the calculation. This 
calculation may correspond to the calculation of the rotation 
matrix, to the cancellation of the imaginary part of an element 
of R, of the real and imaginary parts of an element of R, or the 
calculation of Q. The potential parallelism has been evoked 
and can be associated to the QRD complexity reduction. 

Pipes 



Time 



Fig. 1 : Parallelism illustration with a QRD process on a 4 x 4 
matrix made of real entries. 



E. Reduced Complexity of Parallel Givens Rotations 

The main idea of the Reduced Complexity PGR (RCPGR) 
technique is to perform the computational operations in 
parallel, for both the real an imaginary parts of H, and the 
matrix Q is not explicitly obtained. 



Lemma 1: Consider a modified nn x {riT + 1) input matrix 
constructed as [H y] and with nu > nr, the QRD process on 
the riT first rows results in a 717- x {riT + 1) triangular matrix 
corresponding to [R y], with y = Q y. 

Proof: Let us consider the initialization step of the QRD 
algorithm by introducing the nu x {nr + 1) matrix Q = [H y] 



Algorithm 4 Reduced complexity Givens rotations algo- 
rithm. 
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and the (ny + 1) x {ut + 1) matrix R = In^^+i- 
For the sake of simplicity, let us consider the common 
GS technique. By processing the QRD from columns 1 to 
riT, the following results are obtained: Q ~ [Q y] and 
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The QRD processing final step is realized 



1 column. By considering the GS technique, 
the R coefficients are obtained with the expression Rj = 

algorithm process. 

It has been previously shown Q , = Q. ^ Vi G [1, nT\. 

Consequently R 



■i, rjT + l 



Qf:,;y andRi^„^^„^+i = [Ry]. 



Lemma 1 implies the inputs of the detection stage, i.e. y and R, 
are obtained without explicitly calculating the unitary matrix 
Q. This technique is depicted in Algorithm 4 in a MATLAB- 
like description. This considerably reduces the computational 
complexity as will be seen in the following section. 

IV. Simulation Results 

A. Computational Complexities 

Computational complexities^ of the previously evoked 
QRD techniques have been theoretically obtained and are 
displayed in Table I. Herein, we consider the computational 
complexity to decompose a nn x tit complex matrix. 

The corresponding computational complexities are plotted 
in Figure 2 and the y received vector computation is also 
taken into account, which is not the case in general. These 
simulation results roughly match the classical theoretical 
complexities [13], but they are more realistic in any Digital 
Signal Processor (DSP) fashion programming. GR are more 
complex in terms of the required number of operations 
compared to STGS: for a 8 x 8 complex matrix, the additional 
complexity is 164%. This difference is due to the need for 

^The assumptions are the following: a real product is denoted 1 MUL, a real 
addition as MUL (Multiply Accumulate (MAC) operation), a real division 
as 16 MUL (conditional add- subtract algorithm) and a real square root as 32 
MUL (dynamic number between i and 1 and Taylor series development) 
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GR 
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PGR 
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RCPGR 
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TABLE I: Complexities equivalences. 



Computational complexities 
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Fig. 2: Computational complexities of the summarized tech- 
niques as a function of the size of the to decompose complex 
matrix. 



explicitely computing the Q matrix. Also, results show that 
PCPGR achieves a computational complexity reduction of 
39% and 77% compared to STGS and GR, respectively, all 
for a 8 X 8 complex matrix. 

Results on computational complexities have been presented 
in the paper and offer strictly the same performance. This 
point is verified in the Subsection IV-C with the LTE-A 
parameters background. 



B. Algorithmic Parallelism 

Let us define the parallelism gain as the ratio of the number 
of parallel operations by the total number of operations. The 
parallelism gain of the summarized QRD algorithms is plotted 
in Figure 3, with in reference the STGS-based QRD and the 
proposed RCPGR-based QRD, as a function of the size of the 
decomposed complex matrix; the size of the channel matrix 
ranges from 2 x 2 to 8 x 8, which are the available modes in 
the LTE-A standard. 

The proposed RCPGR-based QRD is more complex, in 
terms of computational complexity, compared to all the clas- 
sical QRD. However, thanks to the parallelism potentiality of 
the algorithm, the proposed technique is the best in terms of 
execution time, which is an essential point for implementation 
considerations. Performances are shown to be strictly the same 
in Section IV-C. 



Parallelism gain 
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- STGS parallelism gain 

- RCPGR parallelism gain 




Fig. 3: Algorithmic parallelism as a function of the size of the 
to decompose complex matrix. 



C. Performances 

Performances results in terms of uncoded BER are given in 
this part of the paper, for both the previously described SIC 
and QRD-M techniques. The channel matrix is considered to 
be perfectly known at the receiver In Figure 4, the RCPGR 
QRD-based SIC and QRD-M (M from 2 to 4) performance 
are plotted as a function of the Signal-to-Noise Ratio (SNR) 
for a 8 X 8 MIMO system, under complex Rayleigh fading 
channel. 

The proposed RCPGR QRD-based SD performance is 

QRD — based SD per f orrnanees 
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Fig. 4: Uncoded BER of the reference ZF MMSE, SIC 
and ML and RCPGR-based QRD-2/3/4 SD as a function of 
the SNR, for a MIMO 4x4 Rayleigh channel and QPSK 
modulations on each layer 

classically compared to the reference Zero Forcing (ZF) 
and ML. Note that in terms of error performance, all the 
introduced QRD techniques lead to similar performance since 
ordering was not included. The ZF equalizer is considered 
since it provides the performance upper bound for the lowest 



possible computational complexity. The ML detector is also 
considered since it provides performance lower bound, but 
for an exponential computational complexity. The ML is 
practically infeasible in the LTE-A background. For example, 
by considering a QPSK modulation on each transmit antenna 
and for a 8 x 8 system configuration, 4^ = 65536 Euclidean 
distances computations are necessary to decode 16 bits only. 
The QRD-M performances have been shown to be close to 
that of the optimum ML, for a polynomial mean complexity 
and with a parallel preprocessing using the RCPGR algorithm. 

V. Conclusions 

In this paper, several QR decomposition techniques have 
been addressed. The computational complexities of those al- 
gorithms as well as their methodology have been investigated. 
Our attention in this paper has also been paid to parallelism 
capabilities of the channel factorization where the paralleliza- 
tion gain of the Givens rotations technique, i.e., RCPGR, has 
been outlined. The performances of the detection algorithms 
can be improved by using signal ordering, which was the topic 
of several studies in the literature. 
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